You are most welcome to my project about E-Commerce Customer Behaviour ML Project! This project helps you to observe and predict Customer behaviour in E-Commerce online shopping using different ml task. By using my dataset from kaggle of consumer behavior dataset, we will discover important factors that enhance influence and develop robust models to enhance engagement.
In this notebook, we will explore data pre-processing in ML, data analyzing, feature enginering, and models training. We will also examine the customer behaviour using different machine learning models and find the best-performing model to get their pre-planed results.
GitHub Repository: GitHub Repository Link
Consumer satisfaction is a important factor for every successful business. It is required to understand the consumer behaviour, and consumer habits for enhance business opportunity. Unsatisfied consumer is stop purchasing, and it shows negative impact in our companies.
By onserving customer satisfaction, businessis can:
import pandas as pd
import plotly.express as pltx
import seaborn as sns
import numpy as np
dff = pd.read_csv("../Data/customer_dataset.csv")
df=dff.copy()
df.head()
| Customer_ID | Gender | Age | City | Membership_Type | Total_Spend | Items_Purchased | Average_Rating | Discount_Applied | Days_Since_Last_Purchase | Satisfaction_Level | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 101 | Female | 29 | New York | Gold | 1120.20 | 14 | 4.6 | True | 25 | Satisfied |
| 1 | 102 | Male | 34 | Los Angeles | Silver | 780.50 | 11 | 4.1 | False | 18 | Neutral |
| 2 | 103 | Female | 43 | Chicago | Bronze | 510.75 | 9 | 3.4 | True | 42 | Unsatisfied |
| 3 | 104 | Male | 30 | San Francisco | Gold | 1480.30 | 19 | 4.7 | False | 12 | Satisfied |
| 4 | 105 | Male | 27 | Miami | Bronze | 720.40 | 13 | 4.0 | True | 55 | Unsatisfied |
Observation : Through out observation there is customer data with customer details, since last purchasing days, spendings, membership detail, and satisfaction level
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 350 entries, 0 to 349 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_ID 350 non-null int64 1 Gender 350 non-null object 2 Age 350 non-null int64 3 City 350 non-null object 4 Membership_Type 350 non-null object 5 Total_Spend 350 non-null float64 6 Items_Purchased 350 non-null int64 7 Average_Rating 350 non-null float64 8 Discount_Applied 350 non-null bool 9 Days_Since_Last_Purchase 350 non-null int64 10 Satisfaction_Level 350 non-null object dtypes: bool(1), float64(2), int64(4), object(4) memory usage: 27.8+ KB
df.dtypes
Customer_ID int64 Gender object Age int64 City object Membership_Type object Total_Spend float64 Items_Purchased int64 Average_Rating float64 Discount_Applied bool Days_Since_Last_Purchase int64 Satisfaction_Level object dtype: object
Observation : That shows a list of data types
df.isnull().sum()
Customer_ID 0 Gender 0 Age 0 City 0 Membership_Type 0 Total_Spend 0 Items_Purchased 0 Average_Rating 0 Discount_Applied 0 Days_Since_Last_Purchase 0 Satisfaction_Level 0 dtype: int64
Observation : There is not any null value available
df.describe()
| Customer_ID | Age | Total_Spend | Items_Purchased | Average_Rating | Days_Since_Last_Purchase | |
|---|---|---|---|---|---|---|
| count | 350.000000 | 350.000000 | 350.000000 | 350.000000 | 350.000000 | 350.000000 |
| mean | 275.500000 | 33.597143 | 845.381714 | 12.600000 | 4.019143 | 26.588571 |
| std | 101.180532 | 4.870882 | 362.058695 | 4.155984 | 0.580539 | 13.440813 |
| min | 101.000000 | 26.000000 | 410.800000 | 7.000000 | 3.000000 | 9.000000 |
| 25% | 188.250000 | 30.000000 | 502.000000 | 9.000000 | 3.500000 | 15.000000 |
| 50% | 275.500000 | 32.500000 | 775.200000 | 12.000000 | 4.100000 | 23.000000 |
| 75% | 362.750000 | 37.000000 | 1160.600000 | 15.000000 | 4.500000 | 38.000000 |
| max | 450.000000 | 43.000000 | 1520.100000 | 21.000000 | 4.900000 | 63.000000 |
Observation : That describe all description about my dataset like count, max value, min value, and etc.
df.drop(columns=['Customer_ID'], inplace=True)
df.head()
| Gender | Age | City | Membership_Type | Total_Spend | Items_Purchased | Average_Rating | Discount_Applied | Days_Since_Last_Purchase | Satisfaction_Level | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 29 | New York | Gold | 1120.20 | 14 | 4.6 | True | 25 | Satisfied |
| 1 | Male | 34 | Los Angeles | Silver | 780.50 | 11 | 4.1 | False | 18 | Neutral |
| 2 | Female | 43 | Chicago | Bronze | 510.75 | 9 | 3.4 | True | 42 | Unsatisfied |
| 3 | Male | 30 | San Francisco | Gold | 1480.30 | 19 | 4.7 | False | 12 | Satisfied |
| 4 | Male | 27 | Miami | Bronze | 720.40 | 13 | 4.0 | True | 55 | Unsatisfied |
Observation : I am not using customer id so I drop the column.
age_bins = [20, 25, 30, 35, 40, 45, float('inf')]
age_labels = ['20-24', '25-29', '30-34', '35-39', '40-44', '45+']
df['AgeBin'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=False)
age_counts = df['AgeBin'].value_counts().sort_index().reset_index()
age_counts.columns = ['Age Group', 'Age Group Count']
plot = pltx.bar(
age_counts,
x='Age Group',
y='Age Group Count',
title="Age Group Distribution of Customers",
text='Age Group Count',
color='Age Group'
)
plot.update_layout(
xaxis_title="Age Group",
yaxis_title="Age Group Count",
xaxis={'categoryorder': 'array', 'categoryarray': age_labels}
)
plot.show()
Observation : Through out age group distribution we know our majority customers age in between 30-34 which is 140 customers, 83 customers age in between 35-39, and there is no any customers below 25 or more than 45 so that our all customers age in between 25 to 45.
age_gender_dist = df.groupby(['Gender', 'AgeBin'], observed=True)['Gender'].size().reset_index(name='Count')
fig = pltx.line(
age_gender_dist,
x="AgeBin",
y="Count",
color="Gender",
markers=True,
title="Age Distribution By Gender"
)
fig.update_xaxes(title="Age Group")
fig.update_yaxes(title="Count")
fig.show()
Observation : our male customers majority is in between 30-34, and female majority customers are 30-44 age are same which is 60.
df['Age'].describe()
count 350.000000 mean 33.597143 std 4.870882 min 26.000000 25% 30.000000 50% 32.500000 75% 37.000000 max 43.000000 Name: Age, dtype: float64
location_dist = df['City'].value_counts()
fig = pltx.pie(
names=location_dist.index,
values=location_dist.values,
title="Location Distribution",
hole=0.4,
color_discrete_sequence=pltx.colors.qualitative.Set3
)
fig.show()
Observation : All our customers are from different locations all location has same percentage from all cities.
items_purchased_gender = df.groupby("Gender")["Items_Purchased"].mean().reset_index()
fig = pltx.bar(
items_purchased_gender,
x="Gender",
y="Items_Purchased",
title="Average Items Purchased by Gender",
color="Gender",
text_auto=True
)
fig.update_xaxes(title="Gender")
fig.update_yaxes(title="Average Items Buy")
fig.show()
Observation : Male customers are buy on an average 14 items, and female customers are buy on an average almost 11 items. It clearly seen that male customers are purchase more items.
rating_vs_items = df.groupby("Average_Rating")["Items_Purchased"].mean().reset_index()
fig = pltx.bar(
rating_vs_items,
x="Average_Rating",
y="Items_Purchased",
title="Average Items Purchased by Rating",
labels={"Average_Rating": "Average Rating", "Items_Purchased": "Average Items Purchased"},
text_auto=True,
color="Items_Purchased",
color_continuous_scale="Blues"
)
fig.show()
Observation : Average items purchased by rating analyse more items purchsed customers gives more rating.
gender_items_avg = df.groupby("Gender")["Items_Purchased"].mean().reset_index()
fig = pltx.pie(
gender_items_avg,
names="Gender",
values="Items_Purchased",
title="Average Items Purchased by Gender",
color_discrete_sequence=pltx.colors.qualitative.Set3
)
fig.show()
Observations : On the graph it clearly shows that male customers purchased 57.3 % items, and female customers purchased 42.7 %.
if df["Satisfaction_Level"].dtype == "object":
df["Satisfaction_Level"] = df["Satisfaction_Level"].astype("category").cat.codes
city_vs_satisfaction = df.groupby("City")["Satisfaction_Level"].mean().reset_index()
fig = pltx.pie(
city_vs_satisfaction,
names="City",
values="Satisfaction_Level",
title="Average Satisfaction Level by City",
hole=0.4,
color_discrete_sequence=pltx.colors.qualitative.Set3
)
fig.show()
Observations : In miami has highest satisfied customers, and houston customers are bery unsatisfied so that we focus on Houston city.
city_vs_items = df.groupby("City")["Items_Purchased"].sum().reset_index()
fig = pltx.bar(
city_vs_items,
x="City",
y="Items_Purchased",
title="Total Items Purchased by City",
labels={"City": "City", "Items_Purchased": "Total Items Purchased"},
text_auto=True,
color="Items_Purchased",
color_continuous_scale="Blues"
)
fig.update_xaxes(categoryorder="total descending")
fig.show()
Observations : Highest items sale in San Francisco which is 1160 items, and Houston city has lowest sales which is 439, that is less than half of San Francisco. Our main city is San Francisco, New York, and Los Angeles.
gender_membership = df.groupby(["Gender", "Membership_Type"]).size().reset_index(name="Count")
fig = pltx.bar(
gender_membership,
x="Gender",
y="Count",
color="Membership_Type",
title="Gender vs Membership Type Distribution",
barmode="group",
labels={"Count": "Number of Customers", "Gender": "Gender"}
)
fig.show()
Observations : Female highest consumers has Bronze Memership, and Male customer Gold and Female customers Gold Membership have same amount of Membership. and Silver Female customers and Bronze Male customers have same membership.
gender_avg_rating = df.groupby("Gender")["Average_Rating"].mean().reset_index()
fig = pltx.bar(
gender_avg_rating,
x="Gender",
y="Average_Rating",
title="Average Rating by Gender",
labels={"Average_Rating": "Average Rating", "Gender": "Gender"},
text_auto=True,
color="Gender",
color_discrete_sequence=pltx.colors.qualitative.Set3
)
fig.show()
Observations : Male customers are more satisfied with female customers, Male customer average rating is 4.30, and Female customers average rating is 3.71.
city_membership = df.groupby(["City", "Membership_Type"]).size().reset_index(name="Count")
fig = pltx.bar(
city_membership,
x="City",
y="Count",
color="Membership_Type",
title="Membership Type Distribution by City",
barmode="group",
labels={"Count": "Number of Customers", "City": "City"}
)
fig.show()
Observations : Chicago and Houston majority customers has Bronze membership, Los Angeles and Miami majority customers has Silver Membership and San Francisco and New York majority customers has Gold Membership.
city_days_purchase = df.groupby("City")["Days_Since_Last_Purchase"].mean().reset_index()
fig = pltx.bar(
city_days_purchase,
x="City",
y="Days_Since_Last_Purchase",
title="Average Days Since Last Purchase by City",
labels={"City": "City", "Days_Since_Last_Purchase": "Avg Days Since Last Purchase"},
color="Days_Since_Last_Purchase",
color_continuous_scale="Blues"
)
fig.update_xaxes(categoryorder="total descending") # Sort cities by avg days since last purchase
fig.show()
Observations : Miami and Chicago customers are not active around 40-42 days, and San Francisco customers are highly active comparively other city.
gender_items_purchased = df.groupby("Gender")["Items_Purchased"].sum().reset_index()
fig = pltx.pie(
gender_items_purchased,
names="Gender",
values="Items_Purchased",
title="Percentage of Items Purchased by Gender",
color_discrete_sequence=pltx.colors.qualitative.Set3
)
fig.show()
Observations : 57.3% items purchased by males and 42.7% items purchased by female customers.
membership_spend = df.groupby("Membership_Type")["Total_Spend"].sum().reset_index()
fig = pltx.bar(
membership_spend,
x="Membership_Type",
y="Total_Spend",
title="Total Spend by Membership Type",
labels={"Membership_Type": "Membership Type", "Total_Spend": "Total Spend"},
text_auto=True,
color="Total_Spend",
color_continuous_scale="Blues"
)
fig.show()
Observations : Gold membership customer are spend more money for purchsed items, and Bronze customer spending is lowest compare to others.
df["AgeGroup"] = pd.cut(df["Age"], bins=age_bins, labels=age_labels, right=False)
fig = pltx.violin(
df,
x="Membership_Type",
y="Age",
color="Membership_Type",
box=True,
points="all",
title="Age Group Distribution by Membership Type",
labels={"Membership_Type": "Membership Type", "Age": "Age"}
)
fig.show()
membership_age = df.groupby(["Membership_Type", "Age"]).size().reset_index(name="Count")
fig = pltx.bar(
membership_age,
x="Age",
y="Count",
color="Membership_Type",
title="Distribution of Age by Membership Type",
labels={"Age": "Age", "Count": "Number of Customers"},
barmode="group",
color_discrete_sequence=pltx.colors.qualitative.Set3
)
fig.show()
Observations : Majority customer which age is 30 that has gold membership, and who has 28 age which has same Bronze and Silver membership. this chart describe membership by age.
gender_satisfaction = df.groupby(["Gender", "Satisfaction_Level"]).size().reset_index(name="Count")
fig = pltx.bar(
gender_satisfaction,
x="Gender",
y="Count",
color="Satisfaction_Level",
title="Satisfaction Level Distribution by Gender",
labels={"Gender": "Gender", "Count": "Number of Customers", "Satisfaction_Level": "Satisfaction Level"},
barmode="group",
color_discrete_sequence=pltx.colors.qualitative.Set3
)
fig.show()
Observations : this chart shows satisfaction level distribution has almost same satisfaction level Male and Female.
According to Age Average Age of Consumers : 33.60 years
Distribution: Consumers are equally divided in age groups, maximum consumers are in the age group of 30 - 34.
Suggestions : -Focus on less satisfied consumers base on the observation female consumes are not more engaging and less satisfied compare to male consumers.
According to Gender :
Base on the observation Male consumers are more than Female consumers.
Purchasing behaviour: Minor fifference in average spending, product buy, and various discount.
According to location:
Accodring to satisfaction :
According to Membership:
According to long time intactive:
import seaborn as sns
for col in df.select_dtypes(include=['object']).columns:
df[col] = pd.Categorical(df[col]).codes
df_correlation = df.corr(numeric_only=True)
fig = pltx.imshow(
df_correlation,
title="Correlation Matrix Heatmap",
labels=dict(color="Correlation"),
color_continuous_scale="Viridis"
)
fig.show()
Observation : It represent Correlation Matrix Heatmap base on data.
for col in df.select_dtypes(include=['object']).columns:
df[col] = pd.Categorical(df[col]).codes
df_correlation = df.corr(numeric_only=True)
features = df_correlation["Total_Spend"].sort_values(ascending=False).drop("Total_Spend")
fig = pltx.bar(
x=features.index,
y=features.values,
title="Feature Correlation with Total Spend",
color=features.values,
color_continuous_scale="Cividis"
)
for i in range(len(features)):
fig.add_annotation(
x=features.index[i],
y=features.values[i],
text=f"{features.values[i]:.2f}",
yshift=-10 if features.values[i] < 0 else 10,
showarrow=False,
)
fig.update_layout(xaxis_title="Feature", yaxis_title="Correlation")
fig.show()
for col in df.select_dtypes(include=['object']).columns:
df[col] = pd.Categorical(df[col]).codes
df_correlation = df.corr(numeric_only=True)
features = df_correlation["Satisfaction_Level"].sort_values(ascending=True).drop("Satisfaction_Level")
fig = pltx.bar(
x=features.values,
y=features.index,
title="Feature Correlation with Satisfaction Level",
color=features.values,
orientation="h",
color_continuous_scale="Plasma"
)
for i in range(len(features)):
fig.add_annotation(
x=features.values[i],
y=features.index[i],
text=f"{features.values[i]:.2f}",
xshift=10 if features.values[i] > 0 else -10,
showarrow=False,
)
fig.update_layout(xaxis_title="Correlation", yaxis_title="Feature")
fig.show()
import plotly.graph_objects as go
for col in df.select_dtypes(include=['object']).columns:
df[col] = pd.Categorical(df[col]).codes
df_correlation = df.corr(numeric_only=True)
age_corr = df_correlation["Age"].sort_values(ascending=False).drop("Age")
items_corr = df_correlation["Items_Purchased"].sort_values(ascending=False).drop("Items_Purchased")
fig = go.Figure()
fig.add_trace(go.Bar(
x=age_corr.index,
y=age_corr.values,
name="Correlation with Age",
marker_color="blue"
))
fig.add_trace(go.Bar(
x=items_corr.index,
y=items_corr.values,
name="Correlation with Items Purchased",
marker_color="orange"
))
fig.update_layout(
title="Feature Correlation with Age & Items Purchased",
xaxis_title="Feature",
yaxis_title="Correlation",
barmode="group"
)
fig.show()
from plotly.subplots import *
features = [
"Total_Spend",
"Items_Purchased",
"Average_Rating",
"Days_Since_Last_Purchase",
]
gender_stats = df.groupby("Gender")[features].describe().reset_index()
fig = make_subplots(
rows=2,
cols=2,
subplot_titles=features,
)
for i, feature in enumerate(features):
fig.add_trace(
go.Bar(
x=gender_stats["Gender"],
y=gender_stats[(feature, "mean")],
name="Mean",
),
row=(i // 2) + 1,
col=(i % 2) + 1,
)
fig.add_trace(
go.Bar(
x=gender_stats["Gender"],
y=gender_stats[(feature, "std")],
name="Std",
),
row=(i // 2) + 1,
col=(i % 2) + 1,
)
fig.update_layout(
title="Customer Statistics by Gender",
showlegend=False,
)
fig.update_yaxes(title_text="Mean", row=1, col=1)
fig.update_yaxes(title_text="Std", row=2, col=1)
fig.show()
features = [
"Total_Spend",
"Items_Purchased",
"Average_Rating",
"Days_Since_Last_Purchase",
]
membership_stats = df.groupby("Membership_Type")[features].describe().reset_index()
fig = make_subplots(
rows=2,
cols=2,
subplot_titles=features,
)
for i, feature in enumerate(features):
fig.add_trace(
go.Bar(
x=membership_stats["Membership_Type"],
y=membership_stats[(feature, "mean")],
name="Mean",
),
row=(i // 2) + 1,
col=(i % 2) + 1,
)
fig.add_trace(
go.Bar(
x=membership_stats["Membership_Type"],
y=membership_stats[(feature, "std")],
name="Std",
),
row=(i // 2) + 1,
col=(i % 2) + 1,
)
fig.update_layout(
title="Customer Statistics by Membership Type",
showlegend=False,
)
fig.update_yaxes(title_text="Mean", row=1, col=1)
fig.update_yaxes(title_text="Std", row=2, col=1)
fig.show()
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
dk = df.drop(columns=['Satisfaction_Level'], errors='ignore')
categorical_features = ['Membership_Type']
numerical_features = ['Total_Spend', 'Items_Purchased', 'Average_Rating', 'Days_Since_Last_Purchase', 'Age']
for col in categorical_features:
dk[col] = pd.Categorical(dk[col]).codes
scaler = StandardScaler()
dk[numerical_features] = scaler.fit_transform(dk[numerical_features])
dk.head()
| Gender | Age | City | Membership_Type | Total_Spend | Items_Purchased | Average_Rating | Discount_Applied | Days_Since_Last_Purchase | AgeBin | AgeGroup | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | -0.945152 | 4 | 1 | 0.760130 | 0.337346 | 1.001981 | True | -0.118359 | 25-29 | 25-29 |
| 1 | 1 | 0.082826 | 2 | 2 | -0.179459 | -0.385538 | 0.139479 | False | -0.639907 | 30-34 | 30-34 |
| 2 | 0 | 1.933185 | 0 | 0 | -0.925570 | -0.867461 | -1.068024 | True | 1.148256 | 40-44 | 40-44 |
| 3 | 1 | -0.739557 | 5 | 1 | 1.756144 | 1.542153 | 1.174482 | False | -1.086947 | 30-34 | 30-34 |
| 4 | 1 | -1.356343 | 3 | 0 | -0.345692 | 0.096385 | -0.033022 | True | 2.116844 | 25-29 | 25-29 |
df = dff.copy()
dk = df.drop(columns=['Satisfaction_Level'], errors='ignore')
categorical_features = ['Gender', 'Membership_Type', 'City']
numerical_features = ['Total_Spend', 'Items_Purchased', 'Average_Rating', 'Days_Since_Last_Purchase']
if df['Age'].dtype == 'object':
age_mapping = {label: idx for idx, label in enumerate(sorted(df['Age'].unique()))}
dk['Age'] = df['Age'].map(age_mapping)
for col in categorical_features:
dk[col] = pd.Categorical(dk[col]).codes
for col in numerical_features:
dk[col].fillna(dk[col].mean(), inplace=True)
for col in categorical_features:
dk[col].fillna(dk[col].mode()[0], inplace=True)
dk = dk.apply(pd.to_numeric, errors='coerce')
scaler = StandardScaler()
dk[numerical_features] = scaler.fit_transform(dk[numerical_features])
assert dk.isnull().sum().sum() == 0, "There are still NaN values!"
inertia = []
silhouette_scores = []
k_range = range(2, 8)
for k in k_range:
print(f"Training KMeans with {k} clusters")
k_means = KMeans(n_clusters=k, random_state=42, n_init=10)
k_means.fit(dk)
inertia_ = k_means.inertia_
silhouette_scores_ = silhouette_score(dk, k_means.labels_)
inertia.append(inertia_)
silhouette_scores.append(silhouette_scores_)
print("Inertia:", inertia_)
print("Silhouette Score:", silhouette_scores_)
print("")
Training KMeans with 2 clusters
C:\Users\ADMIN\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py:110: UserWarning:
Could not find the number of physical cores for the following reason:
invalid literal for int() with base 10: ''
Returning the number of logical cores instead. You can silence this warning by setting LOKY_MAX_CPU_COUNT to the number of cores you want to use.
File "C:\Users\ADMIN\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 205, in _count_physical_cores
cpu_count_physical = sum(map(int, cpu_info))
^^^^^^^^^^^^^^^^^^^^^^^
C:\Users\ADMIN\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning:
KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
Inertia: 904304.5455434138 Silhouette Score: 0.6186782097557806 Training KMeans with 3 clusters
C:\Users\ADMIN\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
Inertia: 408193.6156455684 Silhouette Score: 0.574728029061651 Training KMeans with 4 clusters
C:\Users\ADMIN\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
Inertia: 234481.4587836321 Silhouette Score: 0.5477332643433899 Training KMeans with 5 clusters
C:\Users\ADMIN\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
Inertia: 154153.32132675496 Silhouette Score: 0.5276767448751806 Training KMeans with 6 clusters
C:\Users\ADMIN\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
Inertia: 110387.47225039048 Silhouette Score: 0.5100837311523088 Training KMeans with 7 clusters
C:\Users\ADMIN\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
Inertia: 84102.52920102132 Silhouette Score: 0.4935405962714986
standarized numerical feature to ensure equal share in this process.
Evaluate optimal numbers of cluster using evaluated scores.
Choose 3 cluster depends on metrix.
The clusters to satisfaction levels (Satisfied, Neutral, Unsatisfied).
Ensure that the satisfaction level variable align with the cluster output, that allows into 3 different groups.
fig = pltx.line(
x=k_range,
y=inertia,
title="Inertia vs Number of Clusters",
labels={"x": "Number of Clusters", "y": "Inertia"},
)
fig.show()
fig = pltx.line(
x=k_range,
y=silhouette_scores,
title="Score vs Clusters",
labels={"x": "Clusters", "y": "Score"},
)
fig.show()
levels = df["Satisfaction_Level"].value_counts()
fig = pltx.pie(
names=levels.index,
values=levels.values,
title="Satisfaction Level Distribution",
color=levels.index
)
fig.show()
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
X = df.drop(columns=["Satisfaction_Level"], errors='ignore')
y = df["Satisfaction_Level"]
y = LabelEncoder().fit_transform(y)
categorical_features = ["Gender", "Membership_Type", "City"]
numerical_features = ["Total_Spend", "Items_Purchased", "Average_Rating", "Days_Since_Last_Purchase", "Age"]
for col in categorical_features:
X[col] = LabelEncoder().fit_transform(X[col])
scaler = StandardScaler()
X[numerical_features] = scaler.fit_transform(X[numerical_features])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
models = {
"Logistic Regression": LogisticRegression(),
"Random Forest": RandomForestClassifier(),
"Gradient Boosting": GradientBoostingClassifier(),
"SVM": SVC(),
"Naive Bayes": GaussianNB(),
"KNN": KNeighborsClassifier(),
}
print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}, y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")
X_train shape: (280, 10), X_test shape: (70, 10), y_train shape: (280,), y_test shape: (70,)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
X = df.drop(columns=["Satisfaction_Level"], errors='ignore')
y = df["Satisfaction_Level"]
categorical_features = ["Gender", "Membership_Type", "City"]
numerical_features = ["Total_Spend", "Items_Purchased", "Average_Rating", "Days_Since_Last_Purchase", "Age"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
preprocessor = ColumnTransformer(
transformers=[
("standard_scaler", StandardScaler(), numerical_features),
("one_hot_encoder", OneHotEncoder(handle_unknown='ignore'), categorical_features),
]
).fit(X_train)
X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)
print("Preprocessing Completed")
Preprocessing Completed
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
import time
X = df.drop(columns=["Satisfaction_Level"], errors='ignore')
y = df["Satisfaction_Level"]
categorical_features = ["Gender", "Membership_Type", "City"]
numerical_features = ["Total_Spend", "Items_Purchased", "Average_Rating", "Days_Since_Last_Purchase", "Age"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
preprocessor = ColumnTransformer(
transformers=[
("standard_scaler", StandardScaler(), numerical_features),
("one_hot_encoder", OneHotEncoder(handle_unknown='ignore'), categorical_features),
]
).fit(X_train)
X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)
models = {
"Logistic Regression": LogisticRegression(),
"Random Forest": RandomForestClassifier(),
"Gradient Boosting": GradientBoostingClassifier(),
"SVM": SVC(),
"Naive Bayes": GaussianNB(),
"KNN": KNeighborsClassifier(),
}
for model_name, model in models.items():
print(f"Training {model_name}")
start_time = time.time()
model.fit(X_train, y_train)
end_time = time.time()
print(f"Training {model_name} complete time taken: {end_time - start_time:.2f} seconds\n")
Training Logistic Regression Training Logistic Regression complete time taken: 0.03 seconds Training Random Forest Training Random Forest complete time taken: 0.23 seconds Training Gradient Boosting Training Gradient Boosting complete time taken: 0.33 seconds Training SVM Training SVM complete time taken: 0.00 seconds Training Naive Bayes Training Naive Bayes complete time taken: 0.00 seconds Training KNN Training KNN complete time taken: 0.00 seconds
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, mean_absolute_error
X = df.drop(columns=["Satisfaction_Level"], errors='ignore')
y = LabelEncoder().fit_transform(df["Satisfaction_Level"])
categorical_features = ["Gender", "Membership_Type", "City"]
numerical_features = ["Total_Spend", "Items_Purchased", "Average_Rating", "Days_Since_Last_Purchase", "Age"]
preprocessor = ColumnTransformer([
("scale", StandardScaler(), numerical_features),
("encode", OneHotEncoder(handle_unknown='ignore'), categorical_features)
])
X_train, X_test, y_train, y_test = train_test_split(preprocessor.fit_transform(X), y, test_size=0.2, random_state=42)
models = {
"Logistic Regression": LogisticRegression(),
"Random Forest": RandomForestClassifier(),
"Gradient Boosting": GradientBoostingClassifier(),
"SVM": SVC(),
"Naive Bayes": GaussianNB(),
"KNN": KNeighborsClassifier(),
}
results = {}
for name, model in models.items():
start = time.time()
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
results[name] = {
"Accuracy": accuracy_score(y_train, y_pred),
"F1": f1_score(y_train, y_pred, average='weighted'),
"Recall": recall_score(y_train, y_pred, average='weighted'),
"Precision": precision_score(y_train, y_pred, average='weighted'),
"MAE": mean_absolute_error(y_train, y_pred),
"Time": round(time.time() - start, 2),
}
results_df = pd.DataFrame(results).T
fig = pltx.bar(results_df, x=results_df.index, y=["Accuracy", "F1", "Recall", "Precision", "MAE"], title="Model Performance", barmode='group')
fig.show()
fig = pltx.bar(
results_df,
x=results_df.index,
y="Time",
title="Model Training Time",
labels={"Time": "Training Time (seconds)", "index": "Models"},
color="Time",
color_continuous_scale="Blues",
)
fig.show()
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
for name, model in models.items():
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap="Blues", xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.title(f"Confusion Matrix - {name}")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
rf_param_grid = {
"n_estimators": [50, 100, 200],
"max_depth": [None, 10, 20, 30],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
"bootstrap": [True, False],
}
rf_model = RandomForestClassifier(random_state=42)
start_time = time.time()
rf_tuner = RandomizedSearchCV(
rf_model, param_distributions=rf_param_grid,
n_iter=20, cv=5, random_state=42, n_jobs=-1
)
rf_tuner.fit(X_train, y_train)
y_pred_rf_tuned = rf_tuner.predict(X_test)
tuning_duration = time.time() - start_time
best_rf_params = rf_tuner.best_params_
best_rf_score = rf_tuner.best_score_
print(f"Best Parameters: {best_rf_params}")
print(f"Best Score: {best_rf_score}")
print(f"Tuning Time: {tuning_duration:.2f} seconds")
Best Parameters: {'n_estimators': 50, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_depth': 30, 'bootstrap': False}
Best Score: 0.9928571428571429
Tuning Time: 7.68 seconds
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
rf_original = RandomForestClassifier(random_state=42)
rf_original.fit(X_train, y_train)
y_pred_original = rf_original.predict(X_test)
y_pred_tuned = rf_tuner.predict(X_test)
def evaluate_model(y_true, y_pred):
return {
"Accuracy": accuracy_score(y_true, y_pred),
"F1 Score": f1_score(y_true, y_pred, average="weighted"),
"Recall": recall_score(y_true, y_pred, average="weighted"),
"Precision": precision_score(y_true, y_pred, average="weighted"),
}
original_rf_metrics = evaluate_model(y_test, y_pred_original)
tuned_rf_metrics = evaluate_model(y_test, y_pred_tuned)
performance_df = pd.DataFrame({"Original RF": original_rf_metrics, "Tuned RF": tuned_rf_metrics})
print("Model Performance Comparison (Original vs Tuned RandomForest):")
display(performance_df)
import plotly.express as px
fig = px.bar(
performance_df.T,
barmode="group",
title="Performance Comparison: Original vs Tuned RandomForest",
)
fig.show()
Model Performance Comparison (Original vs Tuned RandomForest):
| Original RF | Tuned RF | |
|---|---|---|
| Accuracy | 1.0 | 1.0 |
| F1 Score | 1.0 | 1.0 |
| Recall | 1.0 | 1.0 |
| Precision | 1.0 | 1.0 |
for param, value in best_rf_params.items():
print(f"✅ {param}: {value}")
✅ n_estimators: 50 ✅ min_samples_split: 5 ✅ min_samples_leaf: 2 ✅ max_depth: 30 ✅ bootstrap: False
feature_importances = rf_tuner.best_estimator_.feature_importances_
feature_names = preprocessor.get_feature_names_out()
feature_importance_df = pd.DataFrame(
{"Feature": feature_names, "Importance": feature_importances}
).sort_values(by="Importance", ascending=False)
fig = px.bar(
feature_importance_df,
x="Feature",
y="Importance",
title="Feature Importance (Tuned RandomForest)",
text=feature_importance_df["Importance"].apply(lambda x: f"{x:.4f}"),
color="Importance",
color_continuous_scale="Blues"
)
fig.show()
My project focus on observing consumer satisfaction byusing ML differenciation techniques. Base on the dataset of 450 consumers, we finds a key points for improving consumer satisfaction levels (Satisfied, Neutral, Unsatisfied) and develop a robust distribution model for prediction.
Random Forest classifier get 100% accuracy after hyper perameter tuning, effectively distinguishing between various satisfaction levels depends on their behavior, purchase, and item Days Since Last Purchase. The best model evaluate performance across all three categories, with relatable prediction.
This customer satisfaction prediction model give as a powerful tool for businesses understand and enhance consumer experience. By observing consumer purchase pattern, satisfaction level. our company can make advanced, data-driaven decision to improve attension, focused marketing techniques, and strategies and make relationship with consumers, and also gives interesting offers, and discounts to the consumers.